To show pandas in a more "applied" sense, let's use it to answer some questions about the MovieLens dataset.
The dataset contains 100,000 ratings made by 943 users on 1,682 movies.
The MovieLens data is a good example for this because it has a lot of inter-ralationships that will requiere joining/grouping :
users
and ratings
are linked together by a key (in this case, the user_id
and movie_id
).
In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.set_option('max_columns', 50)
In [3]:
# pass in column names for each CSV
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users = pd.read_csv('data/ml-100k/u.user', sep='|', names=u_cols)
r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv('data/ml-100k/u.data', sep='\t', names=r_cols)
# the movies file contains columns indicating the movie's genres
# let's only load the first five columns of the file with usecols
m_cols = ['movie_id', 'title', 'release_date', 'video_release_date', 'imdb_url']
movies = pd.read_csv('data/ml-100k/u.item', sep='|', names=m_cols, usecols=range(5))
# create one merged DataFrame
movie_ratings = pd.merge(movies, ratings)
lens = pd.merge(movie_ratings, users)
In [5]:
print movies.head(3), '\n'
print ratings.head(3), '\n'
print users.head(3)
In [6]:
print lens.head()
In [9]:
lens.describe()
Out[9]:
In [5]:
most_rated = lens.groupby('title').size().order(ascending=False)[:10]
print most_rated
using .value_count()
In [7]:
# using .value_count()
lens.title.value_counts()[:10]
Out[7]:
size
and mean
to each group using .agg()
methodWe can use the .agg()
method to pass a dictionary specifying the columns to aggregate (as keys) and a list of functions we'd like to apply.
In [25]:
movie_stats = lens.groupby('title').agg({'rating': [np.size, np.mean]})
print movie_stats.head()
Let's sort the resulting DataFrame so that we can see which movies have the highest average score.
Because movie_stats
is a DataFrame, we use the sort
method (only Series objects use order
).
Additionally, because our columns are now a MultiIndex, we need to pass in a tuple specifying how to sort.
In [26]:
print movie_stats.sort([('rating', 'mean')], ascending=False).head()
The above movies are rated so rarely that we can't count them as quality films. Let's only look at movies that have been rated at least 100 times.
In [55]:
atleast_100 = movie_stats['rating']['size'] > 100
print movie_stats[atleast_100].sort([('rating', 'mean')], ascending=False).head()
Limiting our population going forward
Going forward, let's only look at the 50 most rated movies. Let's make a Series of movies that meet this threshold so we can use it for filtering later.
In [8]:
most_50 = lens.groupby('movie_id').size().order(ascending=False)[:50]
Let's look at how these movies are viewed across different age groups. First, let's look at how age is distributed amongst our users.
In [9]:
users.age.hist(bins=30)
plt.title("Distribution of users' ages")
plt.ylabel('count of users')
plt.xlabel('age');
In [10]:
labels = ['0-9', '10-19', '20-29', '30-39', '40-49', '50-59', '60-69', '70-79']
bins = range(0, 81, 10) # [0, 10, 20, 30, 40, 50, 60, 70, 80]
lens['age_group'] = pd.cut(lens.age, bins, right=False, labels=labels)
print lens[['age', 'age_group']].drop_duplicates()[:5] # preview of age bin
Now we can now compare ratings across age groups.
In [62]:
print lens.groupby('age_group').agg({'rating': [np.size, np.mean]})
Young users seem a bit more critical than other age groups. Let's look at how the 50 most rated movies are viewed across each age group. We can use the most_50
Series we created earlier for filtering.
In [63]:
lens.set_index('movie_id', inplace=True)
In [64]:
by_age = lens.ix[most_50.index].groupby(['title', 'age_group'])
by_age.rating.mean().head(15)
Out[64]:
Notice that both the title and age group are indexes here, with the average rating value being a Series. This is going to produce a really long list of values.
Wouldn't it be nice to see the data as a table? Each title as a row, each age group as a column, and the average rating in each cell.
Behold! The magic of unstack
!
In [66]:
by_age.rating.mean().unstack(1)[10:20]
Out[66]:
unstack
, well, unstacks the specified level of a MultiIndex (by default, groupby
turns the grouped field into an index - since we grouped by two fields, it became a MultiIndex). We unstacked the second index (remember that Python uses 0-based indexes), and then filled in NULL values with 0.
If we would have used:
by_age.rating.mean().unstack(0).fillna(0)
We would have had our age groups as rows and movie titles as columns.
Which movies do men and women most disagree on?
DataFrame's have a pivot_table method that makes these kinds of operations much easier (and less verbose).
In [67]:
lens.reset_index('movie_id', inplace=True)
In [78]:
pivoted = pd.pivot_table(lens, values='rating', index=['movie_id', 'title'],
... columns=['sex'], fill_value=0)
print pivoted.head()
In [79]:
pivoted['diff'] = pivoted.M - pivoted.F
print pivoted.head()
In [80]:
pivoted.reset_index('movie_id', inplace=True)
In [81]:
disagreements = pivoted[pivoted.movie_id.isin(most_50.index)]['diff']
disagreements.order().plot(kind='barh', figsize=[9, 15])
plt.title('Male vs. Female Avg. Ratings\n(Difference > 0 = Favored by Men)')
plt.ylabel('Title')
plt.xlabel('Average Rating Difference');